Author: Joachim Plath
setwd("C:/Users/Joachim/Documents/BC/Atombombe") library(RTutor) ps.name = "understanding bank runs" sol.file = paste0(ps.name,"_sol.Rmd") libs = NULL libs = c("foreign","reshape2","plyr","dplyr","mfx", "ggplot2","knitr","regtools","ggthemes","dplyrExtras","grid","gridExtra","prettyR") # character vector of all packages you load in the problem set name.rmd.chunks(sol.file,only.empty.chunks=FALSE) # Create problemset create.ps(sol.file=sol.file, ps.name=ps.name, user.name=NULL,libs=libs, extra.code.file = "extracode.r", var.txt.file = "variables.txt") # webbrowser show.ps(ps.name, load.sav=FALSE, launch.browser=TRUE, sample.solution=TRUE, is.solved=!TRUE)
This problem set analyzes factors leading to bank runs.
It is developed from the following paper: "Understanding Bank Runs: The Importance of
Depositor-Bank Relationships and Networks", written by Rajkamal Iyer and Manju Puri.
You can download the paper from nber.org/papers/w14280
to get more detailed information.
The dataset and the Stata code, can be downloaded from aeaweb.org/articles.php?doi=10.1257/aer.102.4.1414
Overview:
How to use RTutor and introduction to the issue.
Descriptive and Inductive Statistic:
Summary statistics for the whole dataset and sub groups.
General Overview and the Impact of an Insurance Cover:
Model introduction and a first probit regression.
Stata vs. R:
Differences of Stata and R: how to deal with perfect prediction.
Relations of a Loan and the Insurance Cover:
Do all depositors run which are above the insurance?
Importance of Bank-Depositor Relationship:
How can the bank-depositor relation be assessed?
Influence of Social Networks:
How much influence does a network have on the running decision?
Robustness Check:
Checks whether the findings are dependent on some omitted factors.
Conclusion
The first exercise is an introduction to RTutor, in order to help you learn how to deal with this problemset.
Furthermore, we will define a bank run and take a look on how the definition is reflected in our underlying dataset.
Before you start to use the RTutor html version, you need to be familiar with the interface.
In the problem set you have to solve one code chunk after the other. Consequently you have to start with the first task of an exercise and continue step by step until the last task of an exercise. Notice that you can do the exercise in a different order so that you can choose which one you want to work on.
If you click on one of the numbered buttons on top of the page, you can skip to the related exercise. If you click on the Data Explorer
button, you will get an overview of all loaded data.
All your commands have to be written into the white fields and can be checked on correctness by clicking on the check
button.
The directions are always mentioned in the Task.
Sometimes you'll need further information to solve a task, which is always given in an info-block. In order to see the whole information, you need to click on the highlighted info-block.
The other buttons above the code field are explained in the first exercise.
Also keep in mind that:
- Previous hints: will always be highlighted in italics
- functions()
, packages
or variables
will always be highlighted in red
At the beginning of each exercise, all required data will be loaded, because RTutor doesn't recognize variables from previous exercises by default. Moreover you will have a better overview, which dataset is used in which exercise. Some chapters in this problemset refer to a certain part or table in the replicated paper. References to this paper are attached in brackets behind the heading of the exercise.
For this exercise, you will have to return to the homepage where you've downloaded this problem set.
Download the dataset data_for_transaction_accounts.dat
into your current working directory.
Task: Use the command read.table()
in order to read the downloaded dataset. Subsequently store it into dat_trans
.
When you're finished, click on the check
button. If you need further help, click on the hint
button, which provides you with more detailed information.
The command read.table()
reads a file in table format and creates a data frame out of it.
If you set the working directory correctly the name of the dataset can only be typed in parentheses.
read.table("data_for_transaction_accounts.dat")
Otherwise, you have to set the full path. E.g.:
read.table("C:/mypath/data_for_transaction_accounts.dat")
In order to store your results, proceed as follows:
data=read.table("data_for_transaction_accounts.dat")
Both explanations, storing or saving you results the variable have the same meaning. Make sure that you always proceed as mentioned above.
The check
button evaluates your code and checks if the answer is correct. If you check an incorrect command, a hint is automatically given. If you need help, press the hint
button in order to get additional information. The hint may give you parts of the solution or suggestions about some critical issues.
The run chunk
button processes your code but does not check it. This can be useful if you want to proof the output of your commands or get some additional information.
The data
button shows the datasets, loaded in the current exercise. In the purpose of getting an overview of all loaded datasets, click on the Data Explorer
button at the top of the page.
dat_trans=read.table("data_for_transaction_accounts.dat") #< hint display("Just type: dat_trans=read.table(\"data_for_transaction_accounts.dat\") and press check!") #>
To get an overview of the data click on the data
-button, which shows you a description of the single variables respective to the column titles.
The dataset of dat_trans
contains all depositors who have a transaction account at the headquarters of a bank, which faced a run at March 13, 2001. The bank was located in the state Gujarat, India. The precipitating event was the default of the largest cooperative bank in the state of Gujarat. Neither did the bank have any inter-bank exposures to the defaulted bank nor did the bank have stock investments. Consequently, we can assume that the bank was solvent and healthy at the time when it faced the run. Moreover, the state economy was performing well. Given this scenario we can strongly assume, that the run was a result of an idiosyncratic shock so that we can focus on the impact of the depositors.
For additional information about the collected data, look at part II and III in the mentioned paper.
The phenomenon where depositors rush to withdraw their deposits because they believe the bank to fail is called a bank run. Nevertheless, the amount of how much a depositor needs to withdraw in order to be counted as a runner remains open in this definition. According to Iyer and Puri, a runner is a depositor who withdraws more than 75% of his or her deposits at March 13, 2001.
The running behavior can be measured with the variables runner75
, runner50
and runner25
. In order to get a first impression of these variables, click on the data
button of the last code field and take a look at the related columns.
As you can recognize, these variables are all binary coded. This means that they take either the value of one or zero. Furthermore, these variables indicate whether a depositor withdraws respectively more than 75%, 50% or 25%. To understand to which extent the definition of a runner depends on the threshold of the withdraw level, we compute the sum of these columns.
The following command shows you how to compute the sum of the column runner75
.
This time you only have to click on the check
button.
#< task sum(dat_trans$runner75) #>
Similar, we compute the sum of runner50
. Only press check
.
#< task sum(dat_trans$runner50) #>
Now its your turn:
Task: Use sum
to compute the sum of the column runner25
of the dataset dat_trans
.
sum(dat_trans$runner25) #< hint display("Proceed as in the previous examples. You just have to adjust the column names.") #>
As you can see, the calculated sums are decreasing in the threshold of the withdraw level. This is easy to explain: The proportions of the depositors who withdraw more than 75% are a subset of the depositors who withdraw more than 50%. Further, these depositors are also a subset of the depositors who withdraw more than 25% of their deposits.
To calculate all these numbers within one command, we can use the summarise_each
function out of the dplyr
package.
For the next task, it is recommended to look at the given info block, which is given below.
The summarise_each
function is part of the dplyr
package. It applies one or more functions to the mentioned columns of a given dataset. Also, it recognizes data, which is already grouped and calculates the given functions separately for each group.
For example, if you want to calculate the sum of the variables runner75
and runner50
, which are part of the dataset dat_trans
, use:
library(dplyr) sum_wide=summarise_each(dat_trans,funs(sum),runner75,runner50) sum_wide
Note: always when we mention - show the output - you have to type the assigned variable into the last line. If you click on check
, RTutor evaluates all your commands and also shows the output.
If you want to learn more about how to use a certain function, it is useful to read the related pdf-file. You can quickly find them if you google the requested function. In our case you can look at:
cran.r-project.org/web/packages/dplyr/dplyr.pdf
Task: Make use of the summarise_each
function to compute the sum
of the variables runner75
, runner50
and runner25
which are part of the dataset dat_trans
. Store your result in sum_wide
.
Finally show your results, typing sum_wide
into the last line.
Previous hint: Look at the info-block to see how to use summarise_each
. Don't delete the given command, it's part of the solution.
#< task library(dplyr) #> sum_wide=summarise_each(dat_trans,funs(sum),runner75,runner50,runner25) sum_wide #< hint display("Only add runner25 to the given example in the info-block!") #> #< add_to_hint display("Just use: sum_wide=summarise_each(dat_trans,funs(sum),runner75,...) and type sum_wide into the next line.") #>
Now, we want to plot our results using the ggplot
function. This function needs a data-frame in the long format.
To get this long format, we use the melt
command out of the reshape2
package.
As you see, there is no task, so only click on the check
button.
melt()
can be applied to transform data, that we need in order to plot graphs. melt(data,id.vars,measure.vars,variable.name,value.name)
creates a data frame in dependence of the given id.vars
. Other variables that you want to have in the new dataset must be given by measure.vars
. Out of this specifications, a data frame is created with the same number of id-columns as given in id.vars
. Additionally, there is one column each for the measured variable and its value.
For further reading take a look at: had.co.nz/reshape/introduction.pdf
#< task library(reshape2) sum_long=melt(sum_wide) sum_long #>
Compare the variables sum_long
and sum_wide
. sum_long
has only two columns: one for the value of each column and one for the name of each column. Now this data structure can be used to plot the different sums with ggplot
.
ggplot
is a function of the package ggplot2
, which is an implementation of the so called grammar of graphics. Commands, which generate a plot all follow the same structure. The basis command ggplot
is extended by various components, which are added with the +
operator.
In our case, we use the basic command ggplot(data,aes(x,y,fill))
, which needs a comprised dataset that contains the data that we want to plot. aes
specifies the aesthetic mappings, which are passed to the plot elements. Here we need to have a categorical variable for the x- axis and a continuous for the y-axis.
As mentioned above, you can add various geometries to a plot with functions starting with geom_
, using the +
operator. All available geometries and additional functions are listed at the ggplot2 webpage: docs.ggplot2.org/current/index.html
For a good introduction, look at: noamross.net/blog/2012/10/5/ggplot-introduction.html
If you've prepared your data mydata
with the melt
command, you are now able to use ggplot
as explained below in order to get a bar-graph:
library(ggplot2) plot=ggplot(mydata,aes(x=variable,y=value,fill=variable))+ geom_bar(stat="identity") #you only need to type the variable into a new line to display the plot after you've pressed the `check` button plot
The term "identity" means, that the bars represent the values in the data. If you use "bin" instead, the bar-graph represents the aggregated cases for each occurred x. In this case you're not allowed to set y to a specific column.
It is also possible to add elements to your plot later on, as far as you have stored your plot earlier on.
plot=plot+ggtitle("MyHeading") plot
Task: Create a bar-graph applying the ggplot
command. Make sure, that you use the variable
-column of sum_long
as x-axis and the value
-column as y-axis. Further set fill=variable
.
Dont forget to store your result in the variable plot1
and show your graph.
Previous hint: Take a look at the ggplot info-block! The needed package is already loaded for you.
#< task library(ggplot2) #> plot1=ggplot(sum_long,aes(x=variable,y=value,fill=variable))+ geom_bar(stat="identity") plot1 #< hint display("Create the graph as described in the info block, but instead of mydata use sum_long. Show the graph by typing plot1 into last line.") #> #< add_to_hint display("Just type plot1=ggplot(sum_long,aes(x=variable,y=value,fill=variable))+ geom_bar(stat=\"identity\"). Finally, type plot1 into the last line.") #>
The graph shows you the sum of all runners depending on the threshold. With an exact value of 307, the number of depositors who withdraw more than 75% of their deposits seems to be very small compared to the 10691 observations in our dataset.
The difference between the sums of runner25
and runner75
displays, that the biggest part of the runners withdraw more than 75%. We could interpret the level of withdrawals as a measure of panic: The more a depositor withdraws, the more panic he or she got. Therefore, the biggest part of the people who withdraws seems to be directed by panic. Even though the percentage of runners according to the 75% threshold is only 2.87%, this percentage goes hand in hand with fact that even a small fraction of depositors can cause a bank run. These numbers are quite similar compared to other bank runs. E.g. the run on the IndyMac bank was caused by less than 5% of the depositors.
To get a better understanding of the ggplot-graphs, we want to make our plot look more appropriate. An explanation of what we see is missing in the graph. Moreover, the label of the y-axis should be "sum" instead of value.
Task: Set a heading by adding ggtitle("Number of Runners depending on the running level\n")
to your existing plot using the +
operator.
Make sure that you dont forget to store your results again in plot1
and show plot1
afterwards.
plot1=plot1+ggtitle("Number of Runners depending on the Running Level\n") plot1 #< hint display("Look at the second code example of the infoblock \"ggplot\"!") #> #< add_to_hint display("To create the plot, you only need to type plot1 +ggtitle(\"Number of Runners depending on the Running Level\n\"). Don't forget so store and show your results!") #>
The "\n" at the end of the heading creates a newline after the heading. This makes the plot less squeezed.
Task: Label the y-axis of plot1
. To do so, add ylab("Sum of Runners")
with the +
operator to plot1
. Show the plot immediately and don't store your results.
plot1+ylab("Sum of Runners") #< hint display("Just look at the info-block of ggplot!") #>
After getting more familiar with the term bank run, we now want to get a more precise look on our dataset. We further want to examine factors, which influence the running decision.
As mentioned in the introduction, we will load the dataset which is the basis of our analysis.
These loadings will be done automatically by first clicking on edit
and then on check
.
#< task dat_trans=read.table("data_for_transaction_accounts.dat") #>
In this part, we want to understand the structure of the underlying data. The structure can be easier understood by visualizing some key-characteristics of the data. To get a first overview, we are going to compute summary statistics, which contains the mean, the standard deviation, and the number of observations, that doesn't contain NA's.
Task: Apply the describe
function to your dataset dat_trans
. Set num.desc=c("mean","sd","valid.n")
.
Previous hint: Since the required package is loaded, you only write your command into the subsequent lines.
describe
needs a data frame as a first input parameter. It computes the measures given by num.desc
for each column. This functionality looks similar to the sapply
formula, which also computes given functions for all columns in the dataset. Indeed, the describe-formula internally calls sapply
.
A code example of how to use it is provided below:
describe(dat_trans,num.desc=c("mean","sd","valid.n"))
For additional parameters and other summary commands, take a look at the pfd-file of the prettyR
package: cran.r-project.org/web/packages/prettyR/prettyR.pdf
NA is a logical constant of length one, which contains a missing value indicator.
It simply indicates that there is no entry for the selected item.
describe
takes account for the NA's if you use valid.n
#< task library(prettyR) #> describe(dat_trans,num.desc=c("mean","sd","valid.n")) #< hint display("Just copy the example of the info-block. You don't need to store your results.") #> #< add_to_hint display("Just type: describe(dat_trans,num.desc=c(\"mean\",\"sd\",\"valid.n\"))") #>
If we now want to interpret these results, we have to bear in mind what each variable describes and how it is scaled. Some variables are transformed from their original meaning in order to receive estimates, which can be interpreted more easily. For example, the opening balance at the day of the run is counted in 100's of RS. Therefore, the average opening balance was Rs. 3259. Some of the variables don't make sense to interpret but are shown because we don't want to extend the problem set with select
-commands.
E.g. the address is simply a number, which can be set in various ways and can't be interpreted.
After we have made up a rough overview, we now need to think of how the different variables have an impact on the running behavior, which is the core of our analysis. To accomplish that, we divide our observations into runners and stayers according to the 75% threshold.
To realize that, we need to create a new column to our dataset, called type
to which we assign the value runner
if runner75
equals one and stayer
if runners75
equals zero. This will make our commands easier and the legend of our plots will be more intuitive. As this task is already accomplished for you, you only need to click on the check
button.
#< task dat_trans$type=ifelse(dat_trans$runner75==1,"runner","stayer") #>
Task: Use the group_by
command out of the dplyr
-package to group dat_trans
by type
.
Dont forget to store your result in grouped_dat
.
group_by()
is part of the dplyr
package. It takes the data and converts it into grouped data. The grouping should be done by categorical variables and can be done by multiple variables. The following operations on the data will be done on the grouped data.
An example will show you how to use the command:
library(dplyr) # group data by only one column group_dat=group_by(dat_trans,type)
#< task library(dplyr) #> grouped_dat=group_by(dat_trans,type) #< hint display("This command has only two input-parameters: dat_trans and type. Don't forget to store your computations.") #> #< add_to_hint display("Only type: grouped_dat=group_by(dat_trans,type).") #>
In a next step, we want to visualize the means of the groups. We further don't want to look at the whole dataset, because taking the means of some variables doesn't make sense, as shown in the example of the adress
variable. Therefore we only take a subset, consisting out of: minority_dummy
, above_insurance
, loanlink
, avg_deposit_chng
, avg_withdraw_chng
, opening_balance
, ln_accountage
, avg_transaction
. All of these variables are candidates, which have an impact on the running decision. To get an economic reasoning for the selected variables, take a look at the following info-block.
By analyzing bank-runs, we have to think about economic reasonable factors, which could drive depositors to run. Deposit insurance is a widely used instrument to prevent bank runs. For example, the US rose the insurance limit from 100000$ to 250000$ during the financial crisis. Consequently, we take the deposit insurance into account using the variable:
above_insurance
The relation between bank and depositor could also affect a run. The more intensive this relation is, the more information a depositor can gain about the health of the bank. This relation is measured by a set of variables:
- loan_linkage
- ln_account age
- transactions
Another factor, which could lead the depositor to run, is herd-behavior. We measure this phenomenon by defining a minority. As most of the people in India are Hindu, Muslims are defined as the minority. We measure how the affiliation to a certain group affects the running decision through:
- minority_dummy
The amount of money, which a depositor has on his account at the day of the run, is another crucial factor. We already took account for the insurance cover effect. Therefore, we only look at the balance if the amount is smaller than the insured cover. This leads to the variable:
- opening_balance
Task: Apply the summarise_each()
function to calculate the mean
for each of the following variables:
minority_dummy
, above_insurance
, loanlink
, avg_deposit_chng
, avg_withdraw_chng
, opening_balance
, ln_accountage
and avg_transaction
.
Make sure that you dont forget to save your result into the variable mean_wide
and show the output.
Previous hint: This time you can see a part of the command displayed in green. Delete all of the # in front of the commands and complete these given commands.
#< task_notest # Only replace the ??? with the mentioned function and delete the #s # mean_wide=???(grouped_dat,funs(mean),minority_dummy,above_insurance,loanlink,avg_deposit_chng,avg_withdraw_chng,opening_balance,ln_accountage,avg_transaction) # mean_wide #> mean_wide=summarise_each(grouped_dat,funs(mean),minority_dummy,above_insurance,loanlink,avg_deposit_chng,avg_withdraw_chng,opening_balance,ln_accountage,avg_transaction) mean_wide #< hint display("Only replace the ??? with summarise_each.") #>
Our aim is to visualize the calculated means using the ggplot
-function. Remember that ggplot
needs one column for the x-axis which should be categorical and representative for the different variable names and one for the y-axis, in our case the calculated means.
Task: Use the melt()
command to melt mean_wide
with "type"
as id-variable. Remember to store your command in mean_long
for further purposes and show your results.
#< task_notest # Only adapt the ??? in the command below and delete the #s. # mean_long=melt(mean_wide,id="???") # mean_long #> mean_long=melt(mean_wide,id="type") mean_long #< hint display("Proceede as in Exercise 1.3!") #> #< add_to_hint display("Did you forget to put in the id-variable in parentheses?") #>
The format of the returned tables looks very similar to the tables of exercise 2. All columns, which contain numerical values are transformed into a single column, with the columtitle as rowtitle. Further we have set an id-variable now, which is displayed in the first column for every value of the non-id variables. The length of the table depends on the number of the groups: $length = # groups \cdot # columns$, whereas $#$ denotes the total number.
In the next step, we want to visualize our results to get a better understanding of the different characteristics of the groups. We are especially looking for variables that have discrimination power, which means that the difference between runners and stayers is large.
For this purpose, we draw a bar-graph, which you need to refine later on.
For this time, you only need to click on the check
button.
#< task # this is the basis command plot2=ggplot(mean_long,aes(x=variable,y=value,fill=type))+ # you need positio_dodge() to draw the bars beside each other geom_bar(stat="identity",position=position_dodge())+ # -> info-block geom_text(aes(ymax=0,y=value/2,label=round(value,3)),position = position_dodge(width=1),vjust=-0.25,size=3)+ # -> info-block facet_wrap(~variable,scale="free")+ xlab("")+ ylab("")+ ggtitle("Grouped Means\n") plot2 #>
Faceting partitions a plot into a matrix of panels. facet_wrap(~variable)
creates a single panel for each variable. This is useful if you have data with different scales. Therefore, you should not compare the different panels but the plots within the panels. If you further add scale="free", the scales are adjusted and each panel has its own scale.
For additional information, take a look at:
sape.inf.usi.ch/quick-reference/ggplot2/facet
geom_text
is useful if you want to add labels on your graph. It needs the position defined by aes(x=positionx,y=positoiony,label=value)
and the label, which is drawn into your plot.
A general help for different variations of the ggplot
functions can be found at:
docs.ggplot2.org/current/geom_text.html
Before we start interpreting our result, take a look at the plot above. There you can recognize that each panel is labeled twice: At the top and at the bottom. For this reason, we delete the labels of the x-axis.
Task: Display plot2
and make use of the command scale_x_discrete(breaks=NULL)
using the +
operator in order to delete the labels of the x-axis.
plot2+scale_x_discrete(breaks=NULL) #< add_to_hint display("Just type plot2+scale_x_discrete(breaks=NULL) to get the graph.") #>
Now we get around interpreting the plotted bars:
Remember that we search for variables, which have a large power of discrimination. Regarding the decision to run a bank, the above_insurance
variable has the largest impact: the fraction of depositors who are above the 100000 insured Rs. is nearly 20 times higher represented in the runner-group. This striking difference can be explained easily: the amount above the insurance cover is at stake in case of a default of the bank. Consequently, a rational depositor should run if he or she is above the cover. Nevertheless, we see that 0.7% of the stayers are also above the cover as well. If we get an explanation for this kind of behavior, we could prevent or convince depositors, who are above the insurance cover from running.
We also recognize, that the deposit balance (opening_balance
) is much higher for runners than for stayers at the day of the run. This phenomenon is consistent with our explanation of the insurance cover.
In a nutshell: the more we have, the more we can lose.
This pattern also approves that if only a small number of depositors ran, it can have a large impact on the solvency of the bank, in the case that the runners are rich enough.
Another factor, which has a huge impact on stayers, but relatively little effect on runners is the loanlink
variable. Imagine that a depositor who has an outstanding loan at the bank will have more contact to the banking stuff than a depositor who only stores his money at the bank. Out of this relation, he or she might gain some information, which strengthens his or her opinion about the healthiness of the bank.
After we have taken a closer look at the calculated means, we further want to check how significant these differences are. This validation can be done through a two-sample-t-test. In this case, we conduct a t-test with different standard deviations and unpaired samples. A large t-statistic indicates that the null hypothesis is wrong.
Consider two unpaired samples $(X_{11},...,X_{1N_{1}})$ and $(X_{21},...,X_{2N_{2}})$. Unpaired means, that not only the single observations within the samples, but also the samples themselves are independent.
The two sided two sample t-test, proves whether the difference of the means $\mu_{1}$ and $\mu_{2}$ of two samples are unequal 0. It assumes normally distributed and independent observations, which enables us write:
$$
H_{0}: \left | \mu_{1}-\mu_{2} \right |= 0 \; \; \; \; \; \; vs. \; \; \; \; \; \; H_{1}: \left | \mu_{1}-\mu_{2} \right |\neq 0
$$
In case of unknown standard deviations the test-statistic is calculated as follows:
$$
t=\frac{\bar{x}{1}-\bar{x}{2}}{\sqrt{\frac{\hat{\sigma} ^2_{1}}{N_{1}} + \frac{\hat{\sigma} ^2_{2}}{N_{2}}}}
$$ where $\hat{\sigma}^2$ denotes the estimator of the variance and $\bar{x}$ denotes the estimator for the expectation.
The test-statistic then is approximately Student-t-distributed.
For further reading, take a look at: Greene (Econometric Analysis, 2008) - Chapter 16, p.500-502 Estimation Methology
To conduct a t-test, use the sapply()
function and then apply a subset of dat_trans
to the function.
We use the function TTest()
, which is the first function that shows the elements of the t-test in a compromised format.
This time you only need to take a look at the command, but later on you'll do this task on your own.
Therefore, we subset the variables measured in our bar-graph before.
You don't need to compute this time, just click on check
.
The select(data,col1,col2)
command is part of the dplyr
-package, which extracts certain columns out of the given dataset.
It thus returns a subset of the original dataset.
#< task # we overwrite the select function since it is defined in several packages select <- dplyr::select subset1=select(dat_trans,runner75,minority_dummy,above_insurance,loanlink,avg_deposit_chng,avg_withdraw_chng,opening_balance,ln_accountage,avg_transaction) #>
The next step is conducting the test. This time you don't need to type in the right command. Only press the check
-button. But please acknowledge that you have to do it on your own in the fourth exercise.
sapply()
is a function, which is very useful for data-manipulation. It has two input parameters: sapply(data,FUN)
where data should be a data-frame or a list and FUN a function that is applied to each column of data. Thus the function returns a data-frame which has as many columns as the input data.
For furthere information, look at
ats.ucla.edu/stat/r/library/advanced_function_r.htm
The function TTest
has two parameters of which the first one is the sample to be grouped; the second one is the grouping variable. To avoid inconsistencies, the parameters should be vectors from the same dataset. For visualizing purpose, we only want to show the estimated mean, the p-value and the t-statistic.
#< task t(sapply(subset1[-subset1$runner75],function(x) round(TTest(x,subset1$runner75),3))) #>
By looking at the p-values in the bottom table, we see that all variables except minority_dummy
and avg_deposit_chng
have significant differences between runners and stayers at a 1% level. A p-value below 1% means that if in reality a variable would not systematically differ between runners and stayers. The probability to find differences as extreme or extremer as in our sample would be below 1%. The size of the t-statistic depends on the difference of the means and on the standard deviation. Out of the bar graph and the given statistic we can proof this statement by looking at the above_insurance
variable and the opening_balance
variable: the heights of the related bars are so different that we can assume a large t-statistic, if the standard deviations are not too big. Indeed, the standard deviation of above_insurance
is smaller one, which increases the t-statistic.
These findings strengthen our guess, that the selected variables have an impact on the running decision.
The rejection-area is monotonous increasing in the significance-level, which means that the larger the significance-level is, the larger the rejection area is. The p-value is the largest significance level for which the null-hypothesis, that the mean equals 0, isn't rejected (in case of a two sided test). That's the case when the test statistic is larger than the related quantile. Consequently, one must reject the null hypothesis, if a significance level is larger than the p-value. Therefore, a small p-value is evidence for the alternative-hypothesis. As a rule of thumb we can assume a p-value of 5% if the t-statistic is around +/- 1.96.
After we have made up a general overview of the data, we have to go one step further and think in a little more abstract ways about the running behavior. A depositor runs because he thinks that the bank will go insolvent. His or her opinion can be influenced by two different sources. Firstly, the information he or she has about the health of the bank and secondly, the information he or she got from the behavior of others. We will examine these two sources of information, starting with the personal information.
As in Exercise 2, we will load the dataset on which we drive our analysis.
These loadings will be done automatically, so that you need to first click on edit
and then on check
.
#< task dat_trans=read.table("data_for_transaction_accounts.dat") #>
First of all, we have to think about on an appropriate model. Bear in mind that we want to model the running behavior, that is expressed through the binary variable runner75
. This leads us to the so called probit approach, which models the running probability p with the standard normal distribution: $\mathbb{P}(runner75=1|x)=\Phi (x^{\top }\beta)$.
Think about a model, which is appropriate to measure the impact of some factors on the running decision. Bear in mind that the decision to run is binary coded through the variable runner75
. We therefore need a model, which links the decision to run to a set of factors, like we do it in a regression. Our approach will be to analyze each of these factors in the framework of probability models:
$$ Prob(despositor \; run)=Prob(runner75=1)=F(relevant \; \mathit{effects},\mathit{parameters}) $$
We assume that multiple factors x explain the decision to run. Therefore we can write:
$$ \mathbb{P}(runner75=1|x)=F(x,\beta ) $$ $$ \mathbb{P}(runner75=0|x)=1-F(x,\beta) $$
The set of parameters $\beta$ reflects the impact of changes in $x$ on the probability.
The only thing we need to find is an appropriate model for the right hand side of the equation.
A first thought may be to retain a linear regression model:
$$
F(x,\beta )=x^{\top }\cdot \beta
$$
One problem is, that this function isn't constrained to the 0-1 interval.
Our requirement is a model, which produces predictions that are consistent with the following thoughts:
The model should predict a high running probability for the depositors that
run and a low running probability for the depositors that stayed.
The standard-normal distribution fulfills all our requirements and is therefore an appropriate link-function $F$.
We therefore are able to write:
$$
F(x,\beta)= \Phi (x^{\top }\cdot \beta )=\int_{-\propto }^{x^{\top }\cdot \beta } \phi (t) dt= \int_{-\propto }^{x^{\top }\beta } \frac{1}{\sqrt{2\pi }} exp(-\frac{1}{2} {x}^2)
$$
We call it link-function, because we link linear combinations of the factor x'b with a function F.
Our estimates of $\beta$ will be based on the method of maximum likelihood. Each draw of runner75
is treated as an independent draw of a Bernoulli distribution.
The likelihood-function for a sample of N depositors, can be written as:
$$ L=\prod_{i=1}^{N} \phi(x,\beta )^{runner75_{i}}\cdot \phi(x,\beta )^{1-runner75_{i}} $$
This common density is maximized with respect to $\beta$, which leads us to the problem that the first-order condition can't be solved analytically. Therefore we use the Newton's method, which will usually converge to the maximum of the likelihood in just a few iterations.
The interested reader should take a look at the following book for further reading: Greene (Econometric Analysis, 2008) - Chapter 23, p. 771 ff. Models for Discrete Choice
Task: Use the glm()
command to regress runner75
against: minority_dummy
, above_insurance
, opening_balance
, loanlink
, ln_accountage
, avg_transaction
, avg_deposit_chng
, and avg_withdraw_chng
from the dataset dat_trans
. Dont forget to store the regression output in the variable reg1
.
Previous hint: Further you can delete all the # before the given command and then adapt it!
glm()
is used to fit generalized linear models. If we want to compute a probit regression, we have to set family=binomial(link="probit")
. "). The following example explains it best:
reg=glm(runner75~minority_dummy+above_insurance+opening_balance,family=binomial(link="probit"),data=dat_trans,na.action=na.omit)
The formula notation has the following meaning: regress runner75 on a linear combination of minority_dummy,above_insurance and opening_balance. All the variables mentioned in the formula, must be columns of the data-frame dat_trans. As link function, the standard normal density is used. na.action
is an option, which decides how to deal with NA's. If we use na.action=na.omit
, all relevant observations get dropped in which NA's occur.
#< task_notest # Delete all the # and insert the needed data-frame for the ??? # reg1=glm(runner75~minority_dummy+above_insurance+opening_balance+loanlink+ln_accountage+avg_transaction+avg_deposit_chng+avg_withdraw_chng,family=binomial(link="probit"),data=???,na.action=na.omit) #> reg1=glm(runner75~minority_dummy+above_insurance+opening_balance+loanlink+ln_accountage+avg_transaction+avg_deposit_chng+avg_withdraw_chng,family=binomial(link="probit"),data=dat_trans,na.action=na.omit) #< hint display("Just copy the formula from the example, only delete the ??? and replace it with dat_trans from Exercise 1.") #> #< add_to_hint display("Just type: reg1=glm(runner75~minority_dummy+above_insurance+opening_balance+loanlink+ln_accountage+avg_transaction+avg_deposit_chng+avg_withdraw_chng,family=binomial(link=\"probit\"),data=dat_trans,na.action=na.omit)") #>
In order to get a better understanding of the influences of single variables, we want to show the marginal effects instead of the coefficient, which are calculated by default by the glm()
function. Onwards, we want to compute robust standard errors to get a more precise level of significance.
All these features can be computed with the showreg()
command.
The function showreg
is a useful function to visualize the most common statistics of several regression-outputs.
It can be best explained by the command below:
showreg(list(reg1,reg2),robust=c(TRUE,TRUE),robust.type="HC0",coef.transform=c("mfx","mfx"),digits=3,omit.coef="(Intercept)|ward")
digits
to a neutral number, depending on the digits after the point, which should be shown |
operator In order to not overwhelm you, the showreg()
command is computed for you.
In the following computation, we take care for the mentioned features. Further, we don't show the Intercept. For more clarity, we round all our results to the 5th decimal place.
The only thing you have to do is to click on the check
button.
#< task library(regtools) showreg(list(reg1),robust=c(TRUE),robust.type="HC0",coef.transform=c("mfx"),digits=3,omit.coef="(Intercept)") #>
To interpret each variable, we first will describe the output in general. The marginal effect and the p-value represented by stars is reported in the first row. You are able to see how the p-value and the stars are related at the bottom of the table. In the second row, the robust standard errors are shown in parenthesis.
The first two rows at the bottom of the table show some measures of the relative quality of the statistical model. Two popular measures are the AIC and the BIC. Interpreting these measures only makes sense if we have another model with the same dataset, to which we can compare these two measures. The third row indicates the log likelihood of the maximized common log density of the estimated coefficients. It is always a negative number because the common density is restricted to the [0,1] interval. Nevertheless, the log of the common density is negative. The third number shows the deviance of the model, which is also a measure to compare models of the same dataset. The deviance measures the goodness-of-fit of the model with the estimated coefficients of the ML-estimation compared to the null model. The last row displays the number of observations.
A fist result is the effect of the insurance cover. If a depositor is above the insurance cover, the likelihood of a run increases by 32.9%. Furthermore, this result has a relatively small standard error, which explains the high level of significance. This supports the conclusion, that deposit insurance reduces depositor's panic. But if we take a closer look at depositors below the insurance cover, a rise in the opening_balance
seems to increase the likelihood of running. Even though these depositors are below the insurance cover, there are some who decide to run.
Second, we recognize that the depositor-bank relationship matters. The length of this relation is measured by ln_accountage
, which is highly significant. The depth of the relationship is measured by the loanlink
variable, which has the third largest influence on the probability of running. Both of these variables have a negative marginal effect, which means that the larger they are, the smaller is the probability of a run.
Marginal effects can be calculated in various ways, but generally there are two common methods of how to compute them.
Note that the marginal effects are calculated differently for continuous and binary variables.
Consider the probit model, with $k$ explanatory variables $x$ and $N$ observations: $\phi (\alpha +\sum_{i=1}^{k} \beta_{i}x_{i}r)$
1.The first method is called marginal effect at mean (MEM).
- For continuous variables $x_{j}$, the MEM is given by:
$$
MEM_{j}=\beta_{j}\phi (\sum_{i=1}^{k}\beta_{i}\bar{x}{i})
$$
- For binary variables $x{j}$, the MEM is :
$$
MEM_{i}=\phi(\sum_{i=1}^{k} \beta_{i}\bar{x}{i}| x{j}=1) - \phi(\sum_{i=1}^{k} \beta_{i}\bar{x}{i}| x{j}=0)
$$
Conventionally we have the following relationship between the stars and the p-value:
- $$= 5%
- $$=1%
- $$= 0.1%
These values are calculated on the basis of a 2-sided t-test for the coefficients. The test statistic for a coefficient $\beta_{i}$ is computed as follows: $t_{i}=\frac{\beta_{i}}{\sqrt{Var(\beta_{i})}}$.
Where $t_{i}$ is Student-t distributed.
One problem, which occurs if we want to interpret the effect of continuous variables on the running probability, is that the marginal change of a continuous variable is hard to imagine. Therefore, we will now take a look at a so called effectplot
, which calculates changes in the running probability in a more intuitive way.
Task: Use the raw formula of the effectplot()
-function as described in the info block and plug in reg1
.
Previous hint: Adapt the given code. Therefore delete the # before the command to use the code.
The function effectplot(reg=myreg,numeric.effect="10-90")
is part of the regtools
-package. The first parameter is an object of regression, such as an object received by a glm
formula. The second input is the underlying dataset, which should be a data frame. The last parameter is of type string and contains the quantiles, which are plugged into the probability function.
It definitely helps to compare the magnitudes of the influence of different explanatory variables. The default effect is "10-90", i.e. the effect of when -ceteris paribus- changing an (numeric) explanatory variable from its 10%- to its 90%-quantile.
The following code examples, will explain some of the parameters of the function in more detail:
effectplot(reg1,numeric.effect="10-90")
add.numbers
is set to TRUE`. effectplot(reg1,numeric.effect="10-90",add.numbers=FALSE)
# the \n is used not to squeeze the plot effectplot(reg1,numeric.effect="10-90",main="MyHeading\n")
effectplot(reg1,numeric.effect="10-90",ignore.vars="above_insurance")
# the \n is used not to squeeze the plot effectplot(reg1,numeric.effect="10-90",show.ci=TRUE)
#< task_notest # only replace the ??? with the metioned regression # effectplot(???,numeric.effect="10-90") #> effectplot(reg1,numeric.effect="10-90") #< add_to_hint display("Type: effectplot(reg1,numeric.effect=\"10-90\")") #>
Finally things are getting much clearer: the result shows the percentage of change in the running probability if we take the difference of the 90%-quintile to the 10%-quintile of the related variable and set the other variables to their mean. Thus, the effect of a change in a regarded variable becomes more intuitive than just looking at the deviation, which only reports marginal changes. For example, the interpretation of the effect of the opening_balance
is now more intuitive:
If a depositor has a rise in his balance at the day of the run from RS. 124 to Rs. 6330, his or her running probability rises by 0.75%.
We further see that the loanlink
and ln_accountage
are highlighted in red. These two factors are the only influences, which reduce the running probability.
In the next step, we want to extend our calculated regression as in the replicated paper. We first want to include the variable travel_costs
and second control for the variable ward
.
This part deals with the problems, which occur if you want to replicate Stata regressions with R.
As usual, we load the needed data. We further need the regression from the last exercise.
Just click on edit
first and then on check
.
#< task dat_trans=read.table("data_for_transaction_accounts.dat") reg1=glm(runner75~minority_dummy+above_insurance+opening_balance+loanlink+ln_accountage+avg_transaction+avg_deposit_chng+avg_withdraw_chng,family=binomial(link="probit"),data=dat_trans,na.action=na.omit) #>
If we control for a variable, we want to check whether there is an impact of the controlled variable on the regression. For this purpose one takes the variable one want to control for into the regression formula and checks if there are any striking differences in the other coefficients.
In our case, we will only control for the variable ward
. This variable is a discrete number, which ranges from 1 to 88. Every number represents one ward in the town, where the bank was placed.
To get a better measure of the impact of each ward, we therefore create a dummy variable for each ward. This means, that we introduce 87 binary variables and select one ward as the reference class.
If you are interested in reading more, visit stats.stackexchange.com/questions/17336/how-exactly-does-one-control-for-other-variables
Looking in addition to travel_costs
takes the possible influence of the distance into account.
One could argue that the further afar a depositor lives, the lazier he or she is to run. Also he could compare the benefit with respect to the costs he has to pay in order to get his money. This argumentation will accompany us in the next exercise.
Further we could imagine, that the running decision is dependent on a specific ward. Maybe some people in a certain ward do have better information than others and thus don't run. Also the behavior of runners in one ward could influence the other depositors in this ward. This effect is measured by the variable ward
.
Before we start to regress the dependent variable on our new set of variables, we have to prepare the original dataset. For a better measuring of the variable ward
's impact, we create dummy variables for each ward
.
Task: Apply the function factor()
to the column ward
of the dataset dat_trans
. Operate on the single column with the $
-operator.
Previous hint: In this task you transform the original dataset, thus store your results in dat_trans$ward
.
The function factor()
is used to encode a vector as a factor.
As single input it needs the vector to be factored.
You will need this formula to prepare a categorical variable in your dataset so that a later called regression formula like glm()
creates a dummy-variable for each category.
If you want to factorize a single column of your dataframe, use:
dat_trans$ward=factor(dat_trans$ward)
With the $
operator, you can select single column out of a data-frame.
It can be best explained by a code example:
single_column=mydata$col1
The return value assigned to the variable single_column is of the type of the singe column and not of the type of the whole data frame.
dat_trans$ward=factor(dat_trans$ward) #< hint display("Look at the \"factor\" info-block.") #> #< add_to_hint display("Type: dat_trans$ward=factor(dat_trans$ward)") #>
After the preparation we now want to conduct the regression, which leads us to the following problem:
Stata vs. R: Since we try to replicate the paper, we now came to a crucial point. If you run a regression that includes the factorized ward variable in Stata, it would give you the following warning: ward17 != 0 predicts failure perfectly - 14 obs not used. What this mean can be shown in the following graph:
#< task X=dat_trans # (1) X$ward=as.factor(X$ward) #(2) M=model.matrix(runner75~ward-1,X) #(3) M=cbind(model.frame(runner75~ward,X)[1],M) #(4) M=M[order(M[,1],decreasing=T),] #(5) ex=M[,c("runner75","ward17")] #(6) ex[ex$ward17==1,] #(7) coef(glm(runner75~minority_dummy+above_insurance+opening_balance+loanlink+ln_accountage+avg_transaction+avg_deposit_chng+avg_withdraw_chng+ward,data=X,family=binomial(link="probit"),na.action=na.omit))["ward17"] # (8) #>
If this code may look strange to you, it can be briefly explained:
(1): First we create a copy of the dataset dat_trans
.
(2)+(3): Then we factorize the ward-variable to get a dummy for each ward in the town to better understand the impact and so control for the effect on other estimates.
(4): Next we construct a dataset consisting only out of the dependent variable and the dummy variables.
(5)+(6): We sort this dataset according to the dependent variable and extract the ward17
column and the dependent variable.
(7)+(8): Finally we show only the cases where the ward15 dummy takes the value of 1. We see that in each case, the dependent variable is always 0. We could say: if the ward17
dummy equals 1, it perfectly predicts runner75
to be zero.
If we look at the estimated coefficient of the ward17
variable, we see that it is extremely large. A coefficient of -3.81 for a dummy variable means, that if the value of the dummy is one, the probability of a run is sharply decreasing.
Stata drops automatically all of these variables. Thus to fully replicate the paper we need a function, which drops all the perfect predictors.
I wrote a function called binary.glm
, which does exactly the same what Stata does in case of perfect prediction: In case of a dummy variable, the perfect prediction variable is deleted with all observation for which the dropped variable predicts the dependent variable perfectly. The output consists of the name and the dropped variables.
If one wants to compute standard errors clustered at a variable later on, one has the option to set the input parameter clustervar1
.
To get all the explanatory variables plus the cluster variable in the underlying data frame of the regression, use model.frame()
.
Task: Use the function binary.glm()
to regress runner75
on minority_dummy
, above_insurance
, opening_balance
, loanlink
, ln_accountage
, avg_transaction
, avg_deposit_chng
, avg_withdraw_chng
, ward
and travel_costs
. Also add the adress
variable as a cluster variable and display the dropped variables.
Store your command in the variable reg2
.
Previous hint: Delete the # before the green-inked command and operate on this command.
binary.glm(formula,link,data,clustervar1,show.drops)
has three obligatory variables and two optional variables. The first variable in the function is used as in a general glm
formula. The second variable is the link-function of the binary regression and is of type string. It can be either "probit" or "logit". The data can be the original dataset, so one doesn't need to assign a subset of the data. clustervar1
is optional and needs to be a variable of the dataset one later wants to cluster on. The data type is also string. The last parameter is of type boolean. If evaluating at TRUE, the function prints all the perfect predictors.
We further give a code-example, to make the following tasks easier.
binary.glm()
reg=binary.glm(runner75+minority_dummy+above_insurance+opening_balance,link="probit",data=dat_trans,clustervar="adress",show.drops=TRUE)
reg=binary.glm(runner75+minority_dummy+above_insurance,link="probit",data=dat_trans,show.drops=TRUE)
show.drops=FALSE
reg=binary.glm(runner75+minority_dummy+above_insurance,link="probit",data=dat_trans,clustervar="adress",show.drops=FALSE) #
#< task_notest # This time a cod example is given. You only need to adjust the ??? with the correct Boolean. # reg2=binary.glm(formula=runner75~minority_dummy+above_insurance+opening_balance+loanlink+ln_accountage+avg_transaction+avg_deposit_chng+avg_withdraw_chng+ward+travel_costs,link="probit",data=dat_trans,clustervar="adress",show.drops=???) #> reg2=binary.glm(formula=runner75~minority_dummy+above_insurance+opening_balance+loanlink+ln_accountage+avg_transaction+avg_deposit_chng+avg_withdraw_chng+ward+travel_costs,link="probit",data=dat_trans,clustervar="adress",show.drops=TRUE) #< hint display("Very often you make typing faults. To avoid them us the given command.") #>
After having calculated and adjusted the regression, we now want to visualize our results.
It would be favorable to show both regressions in one table so that we can check if the marginal effects changed in the second regression.
Task: Now use the command showreg()
to get a summary table of your calculated regressions: reg1
and reg2
.
Previous hint: Proceed as in the given example.
#< task_notest # Replace the ??? with the second regression computed above, to get the regression table: #showreg(list(reg1,???),robust=c(TRUE,TRUE),robust.type="HC0",coef.transform=c("mfx","mfx"),digits=3,omit.coef="(Intercept)|ward") #> showreg(list(reg1,reg2),robust=c(TRUE,TRUE),robust.type="HC0",coef.transform=c("mfx","mfx"),digits=3,omit.coef="(Intercept)|ward") #< hint display("only use reg2 instead of the ???") #> #< add_to_hint display("Use: showreg(list(reg1,reg2),robust=c(TRUE,TRUE),robust.type=\"HC0\",coef.transform=c(\"mfx\",\"mfx\"),digits=3,omit.coef=\"(Intercept)|ward\") ") #>
We see that the differences of the marginal effects are very small, if we add the explanatory variables travel_costs
and ward
. This means that the effect of these two variables doesn't seem to change our findings. The significance levels dont seem to change dramatically either. We could say that our results from reg1
are robust to these influences.
Out of the table we recognize two important factors:
1. The effect of the insurance cover on the running probability is the largest and also highly significant.
2. The negative effect of the loan linkage is the second largest, with a very small p-value
Think about an economic explanation for these findings. The effect of an insurance cover seems clear: if one is insured there is no incentive to run. The impact of the loan linkage isn't that clear for us. Also the relation of these two effects should be investigated more intensive. So, in the next sub chapter we focus on the relation of these two influences.
Remember that the ML estimation, tries to estimate the true density of the dependent variable $p$. The Kullback-Leibler Divergence is a measure of the difference of the true density $p$ and the estimated density $\hat{p}$. Intuitively, AIC and BIC try to overcome the difficulty to compare an estimate $\hat{p}$ with the real $p$, because in real-life we don't know the true model. Abstractly spoken we want to measure the information loss, which is produced by taking the estimated model instead of the true model. The better the model is, the smaller this difference is. Thus we can follow, the smaller either BIC or AIC is, the better the model is.
For a brief and general overview, you can look at: en.wikipedia.org/wiki/Akaike_information_criterion
Now we know that having a loan linkage decreases the running probability of a depositor.
On the other hand we know, that being above the insurance cover leads to a large increase of the running probability.
It seems to be, that these two variables react contrary to each other.
Therefore it would be interesting to know, if a depositor who is above the insurance cover might not run if he had a loan relation. For this purpose, we introduce two variables: uninsured_rel
and uninsured_no_rel
.
Before you start, load the needed data. For this purpose, first click on edit
and then on check
.
#< task dat_trans=read.table("data_for_transaction_accounts.dat") dat_trans$ward=factor(dat_trans$ward) org_dat=dat_trans #>
If we want to examine the effect of having a loan linkage or not and being above the insurance cover, we introduce two binary coded variables:
- uninsured_rel
: which means that a depositor is above the insurance cover and has a loan linkage
- uninsured_no_rel
: as uninsured_rel
but without a loan linkage
Task: Run a regression similar to reg1
:
- Use the binary.glm()
function
- Take the explanatory variables as in reg1
but replace above_insurance
with uninsured_no_rel
and uninsured_rel
and regress them on runner75
- Show the dropped variables.
- Store your results in reg3
#< task_notest # Only adjust the ??? with the mentioned variables. Add them in the same order as mentioned! #reg3=binary.glm(runner75~minority_dummy+???+???+opening_balance+ln_accountage+loanlink+avg_transaction+avg_withdraw_chng+avg_deposit_chng,link="probit",data=dat_trans,show.drops=TRUE) #> reg3=binary.glm(runner75~minority_dummy+uninsured_no_rel+uninsured_rel+opening_balance+ln_accountage+loanlink+avg_transaction+avg_withdraw_chng+avg_deposit_chng,link="probit",data=dat_trans,show.drops=TRUE) #< hint hint("Just copy the regression formula from reg1 and replace as mentioned. The other parameters are: link=\"probit\",data=dat_trans,show.drops=TRUE") #>
Look at the output of the previous code chunk. The first entry shows, that uninsured_loan
predicts runner75
=0 perfectly. Whenever the variable uninsured_loan
takes the value of one, the variable runner75
is always zero. In order to better understand this result, we compute the sum of runners for each possible combination of above_insurance
and loanlink
.
Just click on check
, to get the mentioned computations.
#< task summarise <- dplyr :: summarise summarise(group_by(dat_trans,above_insurance, loanlink), num.runners = sum(runner75)) #>
From the first exercise, we know that we have 307 runners. These runners are grouped as follows: 259 depositors are under the insurance cover and without loan linkage are running. Seven depositors with a loan linkage and under the insurance cover are running. Considering the depositors above the insurance cover, 41 run if they have no loan linkage. For depositors, which are above the insurance cover and which have a loan linkage, we got surprising and interesting findings: No runners occurred in this group. This highlights the importance of a loan on the running decision.
If we didnt drop the variable uninsured_rel
and estimate the coefficient of this variable, it would be unusual large. In addition the mentioned paper cant be replicated.
Click on check
, to validate the statement.
#< task reg3.1=glm(runner75~minority_dummy+uninsured_no_rel+uninsured_rel+opening_balance+ln_accountage+loanlink+avg_transaction+avg_withdraw_chng+avg_deposit_chng,family=binomial(link="probit"),data=dat_trans,na.action=na.omit) coef(reg3.1)["uninsured_rel"] #>
With a value of -2.34, the magnitude of the coefficient is very large and shifts the running probability close to zero if the variable uninsured_rel
equals one. This large coefficient has its origin in the selected estimation method. This phenomenon can be explained intuitively: the ML-estimation maximizes the probability of the observed sample. If there is one variable, which predicts the dependent variable perfectly, the likelihood can be most increased by scaling up this variable. For this reason the related coefficient is set as large as possible.
Task: Use the raw function effectplot()
to visualize your estimates of reg3
. Set the heading to main="Change in running probability\n"
.
Previous hint: If this task seems to be too tricky, look at the info-block of effectplot
and copy the command. Afterwards, do your adjustments!
effectplot(reg3,main="Change in running probability\n") #< hint display("Did you forget to use the variable org_dat?") #> #< add_to_hint display("Type: effectplot(reg3,main=\"Change in running propability\n\")") #>
What you can see here is very clear-speaking:
the effect of uninsured_no_rel
shows, that if a depositor hasn't a loan linkage and is above the insurance cover, the running probability rises dramatically. This highlights the importance of the insurance cover, which remains the largest effect.
We drop the variable uninsured_no_rel
to visualize the effect of the other variables in a more detailed version. Further, we show a 95% confidence interval of the effect. You only have to click on the check
-button to display the plot.
#< task effectplot(reg3,ignore.vars="uninsured_no_rel",show.ci=TRUE) #>
The smaller the confidence interval, the more precise is the estimated effect. In general, a confidence level with significance level of 5% tells us in which area the effect lies with probability 95%. For example the estimated effect of the opening_balance
variable lies in a relative small confidence interval. It lies in the closer area of 0.75%. This feature fosters our opinion, that the opening balance will have indeed a significant effect on the running decision. If we look at the effect of avg_transaction
, the confidence interval ranges from a negative value up to a positive. We cannot be totally sure, whether the estimated effect is larger than zero. Therefore, we don't judge it as an important factor on the running decision.
As always we will load the dataset, on which we drive our analysis.
This loading will be done automatically but the download itself has to be done manually. So download the dataset "data_for_survey.dat"
into your current working directory.
After that, you only need to first click on edit
and then on check
.
#< task dat_trans=read.table("data_for_transaction_accounts.dat") dat_trans$ward=factor(dat_trans$ward) # data for the second task dat_survey=read.table("data_for_survey.dat") #>
Now we have some very interesting findings: a loan relation does not only reduce the running probability of depositors in general (first regression), it also keeps the uninsured depositors away from running (third regression).
This may have three reasons:
1. Depositors think that their outstanding loan is offset by their deposits.
2. Depositors get information about the true health of the bank and thus don't run.
3. There may be some socio-economic reasoning, e.g. the wealth.
The first thought can be discarded because in India it isn't allowed to offset outstanding loans against deposits in case of a default. The second reason sounds very interesting and can be proved easily with our dataset.
In this subchapter we want to check the hypothesis whether a loan relation possesses is a source of information and therefore creates some information value. To accomplish that, we look at depositors which had a loan before the bank run and at those who will have a loan in the future.
Therefore we introduce a set of new variables:
Loanlink_before
, loanlink_current
and loanlink_after
.
Look at the description to get more information.
We will first run a regression without the variable loanlink_after
. In a second regression we include this variable and measure if there is an effect of the loanlink_after
on other the coefficients of the other variables.
To check whether a loan is some kind of information source, recall that only a current loan or one in the past can be a source of information: normally if one has a loan, one has to occur at certain points in time to talk to a loan officer or to renegotiate ones loan conditions.
A future loan does not have this feature of a current relation. Prospective borrowers only have to fulfil the mentioned obligations in the future. This can be measured through the variable:
- loanlink_after
Having a current outstanding loan is measured by:
- loanlink_current
A loan link in the past is measured through:
- loanlink_before
Task: Run a regression similar to reg4
. Add first the variable loanlink_after
, then ward
and last travel_costs
. Further, set clustervar="adress"
and show.drops=FALSE
. Store your result in the variable reg5
.
Previous hint: You see that there is already a command in your chunk. This command is part of the solution and mustn't be deleted.
#< task # The first regression is done for you, to avoid long typing. reg4=binary.glm(formula=runner75~minority_dummy+ln_accountage+above_insurance+opening_balance+loanlink_current+loanlink_before+avg_withdraw_chng+avg_deposit_chng+avg_transaction,link="probit",data=dat_trans,show.drops=FALSE) #> #< task_notest # Only replace the ??? with the mentioned variables. Add them in the mentioned order! #reg5=binary.glm(formula=runner75~minority_dummy+ln_accountage+above_insurance+opening_balance+loanlink_current+loanlink_before+avg_withdraw_chng+avg_deposit_chng+avg_transaction+???+???+???,link="probit",data=dat_trans,clustervar="adress",show.drops=???) #> reg5=binary.glm(formula=runner75~minority_dummy+ln_accountage+above_insurance+opening_balance+loanlink_current+loanlink_before+avg_withdraw_chng+avg_deposit_chng+avg_transaction+loanlink_after+ward+travel_costs,link="probit",data=dat_trans,clustervar="adress",show.drops=FALSE)
Task: Now use the showreg()
command to show the results of reg4
and reg5
:
Calculate robust standard errors according to HCO and show the marginal effects.
Round to the 4th decimal place by setting digits=4
and don't show the Intercept and the ward dummies.
Previous hint: Look at the info-block of showreg
!
showreg(list(reg4,reg5),robust=c(TRUE,TRUE),robust.type="HC0",coef.transform=c("mfx","mfx"),digits=4,omit.coef="(Intercept)|ward") #< hint display("Remember that some of the input parameters of showreg needs to be written in a vector, e.g.: robust=c(TRUE,TRUE). For further help, look at the info-block.") #> #< add_to_hint display("Use: showreg(list(reg4,reg5),robust=c(TRUE,TRUE),robust.type=\"HC0\",coef.transform=c(\"mfx\",\"mfx\"),digits=4,omit.coef=\"(Intercept)|ward\")") #>
What we see here, hits exactly our guess, that a loan linkage has an information value: the effect of a future loan linkage (loanlink_after
) is very small and not significant. But the effect of the past loan (loanlink_before
) and the current loan (loanlink_current
) are larger and much more significant. We can conclude that a future loan has no influence on the decision to run or not to run. That's because a depositor doesn't gain any additional information out of a future relation.
The value of the information may come from the conversations of the loan-officer with the depositor or maybe from the fact that the depositor has to go to the bank more often as other depositors without a loan and thus has more chances to get some information about the banks health.
Further we now can explain the coefficient of the ln_accountage
variable: the older the relation of the bank and the depositor is, the more information can be gained about the health of the bank. This leads to a higher trust in the bank and so keeps the depositor away from running.
In the prevailing banking literature, the importance of bank-depositor relationship is highlighted. For example in Goldstein and Pauzner (2005), depositors receive noisy signals about the health of the bank. We now could add, that depositors who had a loan at the bank will receive higher signals. A reason for this might be the interaction with the related loan officer. As Diamond and Dybvic (1983) found out, a bank run depends on depositors belief in the ability of the bank to pay the promised payments. The trust in a bank might therefore get fostered through a loan and make this bad equilibria less likely. Finally, we could guess that a depositor is afraid of losing a potential source of financing. Thus depositors with a loan linkage might have less incentive to run in order to not risk the financing of future projects.
In this subchapter we check whether the last thought, that the running behavior is influenced by socio-economic factors, is reasonable for explaining the loan relation. Therefore we need to get more detailed information about the depositors than we actually have. This detailed information can be gained through a survey, which contains a list of questions regarding to the socio-economic background of a single depositor.
To get a representative sample one has to choose the observations randomly. In our case, 100 depositors who withdrew from their transaction account and 300 depositors, who didn't withdraw were selected. These depositors all belong to different households so that there won't be any correlations between these observations (no clustering is needed). Despite all this, only 282 depositors could be visited because the interviewer didn't meet them all of them on the day of the survey.
In the survey, the depositors had to make statements about their properties: It was asked whether the depositor has an apartment, car, bike or land. In dependence of the answer, a variable wealth was constructed, which sums up the relative part of the named items that a depositor holds. One could assume that the more a depositor owns, the more he or she is harmed by the default of the bank and thus runs very early. Also the depositors age and their education are taken into account. Education was measured as follows: high-school, bachelor degree or master degree. One could argue: the better the depositor is educated, the more realistic is his or her estimate of the health of the bank. Moreover the depositor was asked if he or she has stocks. This could be an indicator for a depositor to run because he faced huge losses from his stock investments. The other statements were similar to our first dataset from Exercise 1.1 b)
To keep focused on the socio-economic background of a depositor, we select some variables of interest. We focus on the depositor's age, measured in years, the amount of stocks he or she has and his or her wealth. The wealth is measured as follows: The depositor is asked whether he or she has a bike, land or an apartment. Then the single assets were weighted on the total amount of assets the depositors have in sum. These three ratios were then added up and represent the wealth variable.
Task: Use the selecet()
function to extract the variables runner75
, stock
, age
, education
, wealth
, education_dummy1
and education_dummy2
out of dat_survey
and store it into subset2
.
Previous hint: In the last task of Exercise 1 you already did a similar command.
The select()
command is part of the dplyr
-package. It extracts certain columns out of the given dataset.
It thus returns a subset of the original dataset.
For the concrete use, we provide a code example.
select(mydata,col1,col2,col3)
In general, there are various ways to select subsets of data. A useful page in general is the QuickR page. In our specific case of data-selection, look at: statmethods.net/management/subset.html
subset2=select(dat_survey,runner75,stock,age, education, wealth, education_dummy1,education_dummy2) #< hint display("Look at the second exercise if you want further examples on how to use select.") #> #< add_to_hint display("Only type: subset2=select(dat_survey,runner75,stock,age, education, wealth, education_dummy1,education_dummy)") #>
When we want to explain the running behavior with socio-economic factors, we have to think about different characteristics of a single depositor and his reasons for his running behavior.
We first look at the stocks. Stocks may indicate a potential liquidation pressure of deposits to offset losses resulting from the stock markets. This might be a reason for depositors with stocks to run immediately. This effect is measured by the variable:
- stock
Also the wealth of a depositor could matter.
This is measured by:
- wealth
The education of a depositor may give a conclusion of his intelligence. The more intelligence a depositor is, the more realistic should be his picture of the banks health. One could follow that the better the education of a depositor is the higher the likelihood that he stays at home, because in our case, the bank is solvent. We take this into account using the variable:
- education
- eduaction_dummy1
- education_dummy2
Last we take the age into our analysis. Imagine someone very old having only a few years left. Knowing this, would he or she run? The age is accounted by:
- age
Now after the loading, we group the observations into runners and stayers and check whether there are significant differences. We want to answer the question whether socio-economic reasons may influence the running decision. If there is an influence, there should be some discriminating power of these socio-economic factors.
Therefore, you now will develop the function called TTest
which you already used before.
This function is used in combination with sapply(data, function(x))
. Recall that sapply
applies the function to each of the columns of the underlying dataset.
Task:
Write a function called TTest. Type the commands step by step, as mentioned in the task here:
- to create the function, write: TTest=function(x,y) {
- in the next line, write: output=t.test(x~y)[c("estimate","p.value","statistic")]
- now make one list out of the fractions and write into a new line: output=unlist(output)
- the return value doesn't have to be marked, only type into the next line: round(output,3)
- finally close your command , by writing into the next line: }
TTest=function(x,y) { output=t.test(x~y)[c("estimate","p.value","statistic")] output=unlist(output) round(output,3) } #< hint display("Did you choose output to store the results of the t.test? For the second step write output=unlist(output)") #>
Task: Use the sapply()
function to perform a t-test. Group the input data on the variable runner75
.
As data input use: subset2[-subset2$runner75]
.
Previous hint: If you can't remember how to use the function, look at the last task of Exercise 2. Delete the # in front of the given code and then adjust it.
#< task_notest # Only adapt the ??? # t(sapply(???,function(x) TTest(x,subset2$runner75))) #> t(sapply(subset2[-subset2$runner75],function(x) TTest(x,subset2$runner75))) #< hint display("Look at the last task of Exercise 2 to remember the syntax") #> #< add_to_hint display("Just type:t(sapply(subset2[-subset2$runner75],function(x) TTest(x,subset2$runner75)))") #>
The output shows that the mean for all variables is very similar. This means that on average there is no clear trend in a certain direction of the decision. Runners as well as stayers seem to have the same socio-economic properties. Further, all variables are not even significant on the 5% level. Putting all this together, it looks as if socio-economic factors don't have an impact on the decision run-stay.
For proving this assumption, we run a probit regression to measure the changes in the running probability in dependence to these effects.
Task: Use the binary.glm()
function to regress runner75
like in reg6
. Also add the variables wealth
, stock
and age
to the regression formula and store your results in the variable reg7
.
Previous hint: Just copy the command and then do the adjustments. Don't delete the given example, it's part of the solution.
#< task # copy the code below and then do your adjustments. Add the mentioned variables at the end of the regression formula in the given order! reg6=binary.glm(runner75~minority_dummy+ln_accountage+above_insurance+opening_balance+loanlink+avg_deposit_chng+avg_withdraw_chng+avg_transaction+education_dummy1+education_dummy2,data=dat_survey,link="probit",show.drops=TRUE) #> reg7=binary.glm(runner75~minority_dummy+ln_accountage+above_insurance+opening_balance+loanlink+avg_deposit_chng+avg_withdraw_chng+avg_transaction+education_dummy1+education_dummy2+wealth+stock+age,data=dat_survey,link="probit",show.drops=TRUE) #< hint display("The regression formula is first input parameter of binary.glm. So only add variables in the order they were mentioned.") #> #< add_to_hint display("Type: reg7=binary.glm(runner75~minority_dummy+ln_accountage+above_insurance+opening_balance+loanlink+avg_deposit_chng+avg_withdraw_chng+avg_transaction+education_dummy1+education_dummy2+wealth+stock+age,data=dat_survey,link=\"probit\",show.drops=TRUE) ") #>
What we see, is that the loanlink
predict the variable runner75=0
perfectly. This means that all questioned depositors with a loan didn't run. Bear this in mind for our interpretation!
In order to not bore you by always typing in the same commands, we directly show the output.
So you only have to press check
.
#< task showreg(list(reg6,reg7),robust=c(TRUE,TRUE),robust.type="HC0",coef.transform=c("mfx","mfx"),digits=3,omit.coef="(Intercept)|ward") #>
It is remarkable that the loanlink
predicts the behavior "staying at home" perfectly.
This fosters the importance of a loan linkage, which then is independent of socio economic factors.
Also being above the insurance cover shifts the running probability about more than 60%, which is enormous. Moreover, this coefficient is highly significant.
Regarding the socio-economic factors, we observe the following:
The stock investments don't have a significant influence on the running probability, which means that the depositors decision isn't due to a liquidity shock experienced by stock losses.
Also the age, education or the total wealth don't seem to influence the running probability, which makes our findings of the loan linkage and the insurance cover robust to controlling for age, wealth and education.
We now step back to Exercise 2, where we stated that the decision to run depends on the information a depositor has about the fundamentals of the bank. This information can be gained from internal sources such as a direct relation to the bank or through external sources such as contacts with other depositors. So the decision of other depositors could influence someones decision whether to run or not. To measure the effects of social networks, we have to structure these external sources. First, we measure the so called introducer network: A common requirement for banks in India is to ask a depositor who wants to open an account to be introduced by a depositor who already has an account at the bank. The purpose of this requirement is to identify the new depositor, as in India there has been no common social security number the country. Therefore we assign all depositors who have the same introducer to one network. Second we measure the neighborhood network by looking at the ward in which a depositor lives. So, all depositors living in the same ward have the same value of the ward-variable.
We proceed as in every exercise and first of all load the dataset on which we drive our analysis.
These loadings will be done automatically so that you only need to first click on edit
and then on check
.
#< task dat_trans=read.table("data_for_transaction_accounts.dat") dat_trans$ward=factor(dat_trans$ward) #>
In this sub-chapter, we try to find a pattern, which shows the relation of runners and wards.
Maybe we can see that all runners come from a specific ward and therefore could assume, that the running decision is influenced by the decision of the depositors living in the same ward.
To get a first overview and an intuition of the meaning that a network influences the decision to run, we do the following:
Task: Apply group_by
to the dataset dat_trans
. Group by the variable ward
and store your result in dat_ward
.
Previous hint: If you forget how to use the function, look at Exercise 2.
dat_ward=group_by(dat_trans,ward) #< hint display("Look at Exercise 2.3. The first info-block contains all information.") #>
Task: Use the summarise
function to sum up the runners in each ward. Therefore set the first input parameter to dat_ward
. Store your results in the variable ward_runner
.
Previous hint: Delete all the # before the green inked code and work with the given commands.
#< task_notest # replace the ??? with the mentioned function # summarise <- dplyr :: summarise # ward_runner=summarise(???,SumRunner=sum(runner75)) #> summarise <- dplyr :: summarise ward_runner=summarise(dat_ward,SumRunner=sum(runner75)) #< hint display("Only type: ward_runner=summarise(group_by(dat_trans,ward),SumRunner=sum(runner75))") #>
Now after you've summed up the runners in each ward, we should think on how the depositor's location measured by the ward influences the running decision. Notice, that the ward-variable could be constructed as follows: The city can be viewed out of a bird's perspective, looking at a Cartesian coordinator system. This means that we divide the city into squares and now give every square a number starting from the top left to the down right. If we now could observe many runners in one ward and some runners at the neighbor ward, we could assume that there is some information spreading around the ward, which affects people in the surrounding area.
Task: Use ggplot()
to draw a graph as in the example.
Use ward
as the x-axis and SumRunner
as y-axis.
Previous hint: Delete all the # before the commands and directly work with them.
#< task_notest # Just replace the ??? with the mentioned variable. # ggplot(ward_runner,aes(x=???,y=???,fill=factor(ward)))+ # geom_histogram(stat="identity")+ # theme(legend.position="none")+ # ggtitle("Sum of Runners in a certain Ward\\n") #> ggplot(ward_runner,aes(x=ward,y=SumRunner,fill=factor(ward)))+ geom_bar(stat="identity")+ theme(legend.position="none")+ ggtitle("Sum of Runners in a certain Ward\n")
If you look at the graph, you see that the runners are concentrated around the large bars. Each bar represents a ward and is shown in a different color. It roughly looks like a Gaussian curve with the respective extreme value as maximum.
This pattern reminds on the following:
The propagation of somebody's information of the health of the bank and his behavior of running is propagated like in the game: "whisper down the lane". Here someone at the start of the lane whispers a statement to his neighbor. The neighbor only understands the half of the information and whispers this information to his neighbor. He also understands just the half of it...and so on.
At the end of the line, you got very noisy true information.
That's why the direct neighbors of a ward are strongly influenced by the behavior of the runners in this ward. The further afar we move, the less people are influenced by this behavior.
In a first step, we try to get an overview over the wards and the runners within the ward.
After having found some interesting patterns, we look more detailed on how the running probability is influenced by depositors of a certain network.
Therefore we run three regressions:
The first regression is ran with the common explanatory variables but in addition only with the social_runners
.
The second regression uses all common variables plus ward_runners
.
The last regression includes social_runners
,ward_runners
and the common variables as used before.
Task: Make use of the glm()
function, to run the third mentioned regression.
Copy the regression formula from reg9
and only add social_runners
at the end of the regression formula.
Store your results in reg10
.
Previous hint: Just copy the command of regression reg9
and then do the adjustments. Don't delete the given example, it's part of the solution.
#< task reg8=glm(runner75~minority_dummy+ln_accountage+above_insurance+ opening_balance+ loanlink+social_runners+avg_deposit_chng+avg_withdraw_chng+avg_transaction,family=binomial(link="probit"),data=dat_trans,na.action=na.omit) #only with social_runners reg9=glm(runner75~minority_dummy+ln_accountage+above_insurance+ opening_balance+ loanlink+ward_runners+avg_deposit_chng+avg_withdraw_chng+avg_transaction,family=binomial(link="probit"),data=dat_trans,na.action=na.omit) # only with ward_runners #> reg10=glm(runner75~minority_dummy+ln_accountage+above_insurance+ opening_balance+ loanlink+ward_runners+avg_deposit_chng+avg_withdraw_chng+avg_transaction+social_runners,family=binomial(link="probit"),data=dat_trans,na.action=na.omit) #< hint display("The regression formula of reg9 is:runner75~minority_dummy+ln_accountage+above_insurance+ opening_balance+ loanlink+ward_runners+avg_deposit_chng+avg_withdraw_chng+avg_transaction. Only add social_runners at the end of the formula.") #> #< add_to_hint display("Type: reg10=glm(runner75~minority_dummy+ln_accountage+above_insurance+ opening_balance+ loanlink+ward_runners+avg_deposit_chng+avg_withdraw_chng+avg_transaction+social_runners,family=binomial(link=\"probit\"),data=dat_trans,na.action=na.omit)) ") #>
We first want to show or regression findings in a table, to make them comparable.
Task: Use showreg()
to show all coefficients of the three regressions you calculated above.
Report the marginal effects with robust standard errors according to HC0 for all of the three regressions.
Round to the 5th decimal place by setting digits=5
and don't show the Intercept.
Previous hint: Only delete the # before the green inked command and then adapt it.
#< task_notest # Only adapt the ??? # showreg(list(reg8,???,reg10),robust=c(TRUE,TRUE,TRUE),robust.type="HC0",coef.transform=c("mfx","mfx","mfx"),digits=???,omit.coef="(Intercept)") #> showreg(list(reg8,reg9,reg10),robust=c(TRUE,TRUE,TRUE),robust.type="HC0",coef.transform=c("mfx","mfx","mfx"),digits=5,omit.coef="(Intercept)")
In the first column, we see the estimation results of the regression where we additionally included only the social network. We see that the probability of a depositor running is increasing in the number of runners in the introducer network. Further, the coefficient of the social_runners
variable is the second largest, which highlights its importance. In column two, the regression, which additionally included only the neighborhood network, is displayed. Similar to the social network, a rise in the fraction of running neighbors increases the probability of a run. Moreover, this effect is the largest, even bigger than the effect of the deposit insurance.
In the third column, we take both network variables together and check for the effect on the other variables when we take these two influences together. Both effects are still significant and only decrease a bit.
Our analysis shouldn't end up without checking our results being robust to certain influences. We thus need to think about factors, which could have an influence on our recent findings. We will adapt our probit model according to these factors and check, whether our findings remain the same.
Download the dataset data_for_term_deposit_accounts.dat
into your current working directory.
Automatically, we will read the dataset on which we drive our analysis, so you only need to click on edit
and then on check
.
#< task dat_trans=read.table("data_for_transaction_accounts.dat") dat_trans$ward=factor(dat_trans$ward) dat_term=read.table("data_for_term_deposit_accounts.dat") #>
One could argue that our findings depend on the definition of a runner. This is indeed a reasoned objection. But remember our first bar graph, which showed how the sum of runners depends on the withdraw level - there were no striking differences. The impact of these differences on our regression coefficients can be shown, if we regress our explanatory variables against these different definitions of a runner.
Task: Copy the given command and only change the dependent variable from runner50
to runner25
.
Store your results in the variable reg12
.
Previous hint: Don't delete the given code. It's part of the solution and will also be tested if you click on the check
button.
#< task_notest reg11=binary.glm(runner50~minority_dummy+ln_accountage+ above_insurance+opening_balance+loanlink+avg_withdraw_chng+avg_deposit_chng+avg_transaction+ward,data=dat_trans,link="probit",show.drops=FALSE) #> reg12=binary.glm(runner25~minority_dummy+ln_accountage+ above_insurance+opening_balance+loanlink+avg_withdraw_chng+avg_deposit_chng+avg_transaction+ward,data=dat_trans,link="probit",show.drops=FALSE) #< hint display("This time you have to copy the whole function and not only the regression function.") #> #< add_to_hint display("Type: reg12=binary.glm(runner25~minority_dummy+ln_accountage+ above_insurance+opening_balance+loanlink+avg_withdraw_chng+avg_deposit_chng+avg_transaction,data=dat_trans,link=\"probit\",clustervar1=\"adress\",show.drops=FALSE)") #>
Further one could argue, that withdraws not only occur at a certain point in time.
In our analysis we set the running date to March 13, 2001. All the earlier withdraws are not taken into account.
Now, we extend the period and define a depositor as a runner who withdraws between March 9 and March 13, 2001. During this period, the following occurred:
on the 9th of March the largest cooperative bank faced a bank run and went insolvent on March, 13.
The variable runner75_extended
captures exactly the described effect.
This time you only have to click on the check
button:
#< task reg13=binary.glm(runner75_extended~minority_dummy+ln_accountage+ above_insurance+opening_balance+loanlink+avg_withdraw_chng+avg_deposit_chng+avg_transaction+ward,data=dat_trans,link="probit",show.drops=FALSE) showreg(list(reg11,reg12,reg13),robust=c(TRUE,TRUE,TRUE),robust.type="HC0",coef.transform=c("mfx","mfx","mfx"),digits=4,omit.coef="(Intercept)|ward") #>
As can be seen from the table, the significant levels of e.g. the loan linkage don't change. We could say that our finding of a significant effect of a loan linkage is robust to the definition of a runner. Thus the withdraw level of the depositors doesnt matter. If we moreover extend the period in which a depositor can withdraw, we don't see any large change in the significant levels. This makes our findings further robust to the time period.
To argue only with the significant level includes two aspects. The significance level here is derived out of the t-statistic, which is the coefficient divided through the standard error of coefficient. If our significance level is low, our t-statistic is large. This can be the case if either the coefficient is large (if our significance level increased, we could then say, that the effect is larger in respect to the new defined threshold) or the standard error of the coefficient is small (which means that the estimated coefficient is very close to the mean of the coefficient).So if our significance level remains the same, you have to look at the coefficient and the standard error to see where this effect is coming from.
So far, we only looked at transaction accounts. But like other banks our examined bank also has term deposit accounts. The purpose of these accounts is the long term money saving. Therefore, one makes a contract for leaving his money to the bank until a certain date. Usually the interest rate for such accounts is higher than for transaction accounts. If a depositor wants to withdraw his or her deposits before the contracted maturity, the depositor doesn't get the full interest payments. Only a fraction minus a penalty is paid. Therefore, a depositor having saved his money on term deposit accounts has to pay liquidation costs, which may influence his decision to run. Therefore we look at term deposit accounts and transaction accounts separately.
We now show you the regression outcomings for each table produced in the previous exercises. The only thing to do is to download the dataset. The regressions and the related findings will be done automatically.
The following subtasks are done automatically.
You only have to click on check
!
#< task_notest reg14=glm(runner~minority_dummy+above_insurance+opening_balance+ln_accountage+loanlink+ln_maturity,family=binomial(link="probit"),data=dat_term,na.action=na.omit) dat_term$ward=factor(dat_term$ward) reg15=binary.glm(runner~minority_dummy+above_insurance+opening_balance+ln_accountage+loanlink+ln_maturity+ward+travel_costs,data=dat_term,link="probit",clustervar1="household_key",show.drops=TRUE) showreg(list(reg14,reg15),robust=c(TRUE,TRUE),robust.type="HC0",coef.transform=c("mfx","mfx"),digits=3,omit.coef="(Intercept)|ward") #>
We see that the effects are very similar to the findings in 3.3. The three findings:
- being above the insurance cover increases the running likelihood
- the higher the opening balance is the higher the running probability is
- having a loan linkage and a long relation to the bank, decreases the likelihood to run
The only concern seems to be the significance level of the minority_dummy
, which is smaller than the dependent in transaction accounts.
Furthermore, we see a variable called ln_maturity
whose sign is negative. This variable measures the distance in days to the contracted maturity. The signs of the coefficients seem intuitive: The closer a term deposit account is far away from his maturity, the more penalty is to pay in case of a withdraw.
#< task_notest reg16=binary.glm(runner~minority_dummy+opening_balance+ln_accountage+loanlink+ln_maturity+uninsured_rel+uninsured_no_rel,data=dat_term,link="probit",show.drops=TRUE) showreg(list(reg16),robust=c(TRUE),robust.type="HC0",coef.transform=c("mfx"),digits=3,omit.coef="(Intercept)|ward") #>
The results are in line with our findings in Exercise 5 for the transaction accounts: - If a depositor is above the insurance cover and has a loan relation, he or she doesn't run - On the other hand, if a depositor is above the cover and has no loan relation, the running probability rise dramatically
After we find out that loan linkages significantly reduce the running probability, we now explain this effect.
#< task_notest reg17=glm(runner~minority_dummy+ln_accountage+above_insurance+opening_balance+loanlink_current+loanlink_before+ln_maturity,data=dat_term,family=binomial(link="probit"),na.action=na.omit) reg18=binary.glm(runner~minority_dummy+ln_accountage+above_insurance+opening_balance+loanlink_current+loanlink_before+loanlink_after+travel_costs+ward+ln_maturity,link="probit",data=dat_term,clustervar="household_key",show.drops=FALSE) showreg(list(reg17,reg18),robust=c(TRUE,TRUE),robust.type="HC0",coef.transform=c("mfx","mfx"),digits=3,omit.coef="(Intercept)|ward") #>
We get the same findings like in the regression with the transaction accounts.
Especially the main effects of the loan linkage are quiet similar:
- A future loan has no significant impact on the running decision
- A current loan has a negative impact and is highly significant
- A past loan also has a significant and negative influence
We conclude: The depositor-bank relationship may reveal information about the health of the bank and thus keeps the depositor away from running!
Finally, we want to recapitulate your analysis and summarize the most important findings. We find that the insurance cover is the most powerful way to keep a depositor away from running. Uninsured depositors have a much higher running probability than uninsured. While the insurance cover helps to mitigate a run, it is only partial effective. A second finding is, that the length of the bank-depositor relationship and a past or outstanding loan are important factors to prevent the depositor from running. Now remember the third factor:
Final Task: Which factor has a significant impact on the running decision?
- "stocks"
- "age"
- "neighbor_runners"
Assign one of these factors to the variable answer
.
#< task_notest # Just write one of the mentioned factors answer="???" #>
We saw that the more people in the depositor's network run, the more likely is the depositor to run.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.